---
title: "ShipsImputation"
author: "John Severini"
date: "2024-06-26"
output: html_document
---
```{r}
# Importing the ships to be imputed
library(Amelia)
library(tidyverse)
library(dplyr)
library(readxl)
library(readr)

ships <- read_excel("ships.xlsx")
```

```{r amelia}
# Cleaning
ships.cleaned <- ships %>%
  select(-broadside_imputed, -index, -name, -owner) %>%  # Excluding the ID variables not part of the imputation.
  mutate(
    broadside = as.numeric(broadside),
    disp_norm = as.numeric(disp_norm),
    comp = as.numeric(comp),
    guns = as.numeric(guns),
    launch_year = as.numeric(launch_year),
    year_first = as.numeric(year_first)
  )
ships.cleaned <- as.data.frame(ships.cleaned)

# Creating separate set of ID columns. Note that broadside_imputed will indicate broadsides to copy back to the Ships dataset later. "Broadside" is the only variable that will end up using its imputations, though the other categories will inform the imputation itself.
ships.ids <- ships %>%
  select(index, name, broadside_imputed)

# Setting seed
set.seed(15)

# Creating bounds to ensure non-negative imputed numbers for broadside. Bounds are derived from the existing data. The lowest amount of guns we are imputing for is 14, while the highest is 63. The lowest broadside within that range that we know of is 28, while the largest is 880.

# If you want to impute the other values for your own purposes, you will similarly need to bound the other variables you want to avoid negative numbers. If you change the variables used for imputation, be sure to change the row number at the beginning of the matrix.

#For our purposes currently, broadside is the only variable that will be re-included for descriptive analysis. This is because the unknown, underlying variance of broadside is likely very small compared to that of displacement and complement. Displacement measurements varied between countries, especially during the Age of Sail, and could change by year, shipwright, and geography. Complement, too, varied wildly by country and was often not necessarily representative of the amount of sailors that actually crewed any given ship. Including imputed complement would require layering a second set of assumptions on top of the first. By contrast, broadside is explicitly calculable from the reported number and weight of guns on a ship, which changed only occasionally over time.

bounds <- matrix(c(4, 28, 880), nrow = 1, ncol = 3)
bounds

# Imputing only the variables we want.

# A quick note: "owner" as a nominal variable heavily slows the MI and has not produced a meaningful difference when included in the multiple imputation. You can try it yourself by making sure it is included in the ships.cleaned above. It may have a tendency to create errors in the MI as well, so you may have to try a few times before the variance matrix becomes invertible for all of the imputed datasets. You will also want to lower m because of how long it will take. Best practices in MI usually entail limiting the range of any nominal variable, and so it may also be useful to consider alternative specifications for what owners to exclude.
ships.imputed <- amelia(ships.cleaned,
                #noms = "owner",
                bounds = bounds,
                empri = .05, # Adding a ridge prior for stability due to the high colinear variables.
                m = 5)
ships.imputed
```

```{r}
# Creating the averaged dataset

# Extracting imputed datasets from Amelia output
imputed_datasets <- lapply(1:5, function(i) {
  data <- as.data.frame(ships.imputed$imputations[[i]])
  data <- data[, !names(data) %in% c("owner")]  # Excluding the nominal "owner" column
})

# Computing row means across imputed datasets
ships.imputed.avg <- Reduce(`+`, imputed_datasets) / 5

# Appending excluded columns back to averaged imputed data
ships.imputed.avg <- cbind(ships.imputed.avg, ships.ids[c("index", "name", "broadside_imputed")])

# Writing averaged imputed dataset to CSV. Remember to remove the broadsides that you don't want to include in the Ships dataset. We include the broadside_imputed value for the ones we use.
write.csv(ships.imputed.avg, "ships.imputed.avg.csv", row.names = FALSE, na = "")
```